Conversation
…ces for stability in multi-domain constructs. Update logic to ensure proper handling of segment boundaries and improve clarity in comments.
|
Hey @ALGW71 - hope you're well! Have you had the chance to check this out yet? Thanks! |
|
Hi @eliottpark, so sorry about this, it has been extremely busy with my new job and life stuff. When I test this on the SCFV file (https://github.com/oxpig/ANARCII/blob/main/notebook/scfv_testing.ipynb), it is causing 11 SCFV sequences in the PDB to miss the 23-Cys. Whereas these are captured by the old code.
The original code misses 6 chains (of an SCFV pair) See files: But then it falls down on the use cases found by @eliottpark in #116 and #118 In it's current form I cannot justify merging this PR #119, nor magic number #117 The original code still performs best on the PDB which is the use case for the three SAbDab musketeers below... @benjaminhwilliams @HenrietteCapel @ovavourakis thoughts? My feeling is that you each of you have different use cases... @eliottpark if this is working for you then you must just use this for your stuff, but we will not merge. |
|
I think @benjaminhwilliams @ovavourakis @HenrietteCapel need to look at #116 and assess how often those types of sequences are found in the PDB. If they are not real world examples then what @eliottpark has found does not warrant a fix - it something specific to his company, and until they start filling the PDB with seqs of this type it is not an OPIG issue.. However if they are found in reality, then we can spend some time designing a fix that solves the issue, without compromising performance on PDB/SAbDAb. |
This PR fixes ScFv (“multi-domain”) window selection for repetitive antibody constructs (e.g., VH+CH1+linker+VH+CH1 with identical repeats listed in issue #118) where the second variable region could be missed unless a mutation broke scoring ties.
Key changes in SequenceProcessor._handle_long_sequences() ScFv mode:
Segment-relative indexing for minima/peak detection to avoid instability from global .index() behavior on repeated score values.
Peak search across the full interval between minima boundaries, rather than only the first ~50 residues after a boundary, so peaks that occur later in long CH/linker regions are still detected.
Result: ScFv splitting is now deterministic and reliably identifies downstream variable regions in repeated-domain sequences, producing correct offsets/windows for subsequent numbering.